- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: detect and extract json from log-lines #1248
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request adds a new directory copy in two Dockerfiles to include the Changes
Suggested labels
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/handlers/http/ingest.rs (1)
69-73
: Introducingextract_log
from headers
Extracting the optionalextract_log
from request headers yields more flexibility to process inline logs. Consider bounding or validating the header string length if logs are large.src/event/format/known_schema.rs (1)
38-103
:SchemaDefinition
withcheck_or_extract
This method systematically checks for existing fields before performing regex extraction. It's efficient, but consider providing more detailed error info for partial matches or missing groups.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile
(1 hunks)Dockerfile.debug
(1 hunks)Dockerfile.kafka
(1 hunks)resources/formats.json
(1 hunks)src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(1 hunks)src/handlers/http/ingest.rs
(9 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(3 hunks)src/handlers/mod.rs
(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(105:110)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (22)
Dockerfile.kafka (1)
45-45
: LGTM - Adding resources directory to Docker imageThis change ensures the
resources
directory is available in the Kafka-enabled Docker image, which is needed for the new JSON extraction feature.Dockerfile.debug (1)
32-32
: LGTM - Adding resources directory to debug Docker imageThis change ensures the
resources
directory is available in the debug Docker image, consistent with changes to other Dockerfiles.Dockerfile (1)
33-33
: LGTM - Adding resources directory to main Docker imageThis change ensures the
resources
directory is available in the main Docker image, consistent with changes to other Dockerfiles.src/handlers/mod.rs (1)
25-25
: New header constant for log extraction featureThis constant defines a header key that will be used to control the JSON extraction functionality from HTTP requests. The naming convention is consistent with other header constants in this file.
You might want to verify all places where this constant is being used:
#!/bin/bash # Find all usages of the new EXTRACT_LOG_KEY constant rg "EXTRACT_LOG_KEY" --type rustsrc/event/format/mod.rs (1)
42-42
: New module for JSON extraction from log linesThis addition exposes the new
known_schema
module which implements the core functionality for detecting and extracting JSON data from log lines.Ensure the module implementation includes proper error handling and testing:
#!/bin/bash # Check if tests are included for the new module fd "known_schema" --type file | grep testresources/formats.json (1)
1-554
:❓ Verification inconclusive
Check for potential catastrophic backtracking in regex patterns
All patterns appear well-defined, but indefinite quantifiers such as.*
may lead to performance overhead or catastrophic backtracking in certain edge cases. It's recommended to test or verify them under load to ensure performance is acceptable, especially if logs are large or patterns are processed frequently.
🏁 Script executed:
#!/bin/bash # Searching for suspicious indefinite quantifiers in patterns. rg -A2 '\.\*' resources/formats.jsonLength of output: 28115
Performance Alert: Validate Regex Load Handling in formats.json
The regex patterns include several instances of unbounded quantifiers (e.g.,(?<body>.*)
). While these are consistently anchored (typically at the end of each pattern), which generally minimizes backtracking risks, their behavior under heavy or unusually large log entries remains uncertain. Please consider stress-testing these patterns with real-world log loads to confirm that they do not incur performance issues or catastrophic backtracking in production.src/handlers/http/modal/utils/ingest_utils.rs (3)
26-26
: Import ofKNOWN_SCHEMA_LIST
No issues with the new import. It consolidates all known schemas in a single place, which is a good approach for maintainability.
38-41
: New parameterextract_log
These changes introduce anOption<&str>
to accommodate the extraction of raw log data from a specific field. This approach is flexible and aligns well with optional header usage.
72-76
: Custom log extraction branch
CallingKNOWN_SCHEMA_LIST.extract_from_inline_log
forLogSource::Custom
is a solid extension. Any malformed or unknown format can gracefully fall back with the_
arm.src/handlers/http/ingest.rs (7)
31-33
: ImportingUnacceptable
and new header keys
Bringing inUnacceptable
for error handling andEXTRACT_LOG_KEY
constants is clear. This decouples the logic for recognized known formats from the rest of the code.
90-90
: Invokingflatten_and_push_logs
Passingextract_log
ensures custom inline log extraction for the default ingest path. Implementation is consistent with the newly introduced logic.
153-153
: No extraction for OTEL paths
These lines passNone
toflatten_and_push_logs
, skipping inline log extraction. This is correct for OTEL ingestion flows.Also applies to: 191-191, 231-231
273-277
:extract_log
usage inpost_event
Mirroring the earlier approach foringest
, which keeps the logic consistent across ingestion endpoints.
285-285
: Invokingflatten_and_push_logs
Same approach as iningest
, passing the optionalextract_log
. The code remains consistent.
351-353
: IntroducingKnownFormat
variant
AttachingUnacceptable
toPostError
asKnownFormat
clarifies error handling for unrecognized or invalid logs. This approach fosters clarity in user-facing error messages.
379-379
: HTTP status for KnownFormat errors
MappingKnownFormat
toBAD_REQUEST
is appropriate, as the log data doesn't match a recognized pattern. Ensure detailed messages guide users to correct their log format.src/event/format/known_schema.rs (6)
1-17
: License header updated
No issues found.
19-33
: Use statements and static initialization
Declaring a lazy staticKNOWN_SCHEMA_LIST
ensures we only parse and compile the known formats once, improving performance. The approach is straightforward and maintainable.
34-36
: Custom errorUnacceptable
Having a specialized error type clarifies when the log content fails known format checks.
105-125
: Data structures for known formats
UsingFormat
andPattern
to store multiple regex definitions per format is flexible. TheEventProcessor
organizes them well.
127-202
: Implementation ofEventProcessor::new
andextract_from_inline_log
Compiling regex patterns once ensures we avoid performance overhead at runtime. The approach to skipping unknown formats with a warning is practical, though consider if a fallback or partial extraction is desired.
205-474
: Comprehensive test suite
The tests cover multiple scenarios, including edge cases like missing fields and partial matches. This broad coverage is commendable, ensuring confidence in the new extraction logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/event/format/known_schema.rs (3)
126-159
: Consider handling regex compilation errors more robustly.Currently, regex patterns that fail to compile are silently ignored with only a log message. This could lead to situations where expected patterns are not applied without a clear indication to the user.
Consider tracking failed patterns and potentially surfacing this information through a method that can report on schema health, such as:
pub struct EventProcessor { /// Map of format names to their corresponding schema definitions pub schema_definitions: HashMap<String, SchemaDefinition>, + /// Patterns that failed to compile + compilation_errors: HashMap<String, Vec<String>>, } impl EventProcessor { /// Parses given formats from JSON text and stores them in-memory fn new(json_text: &str) -> Self { let mut processor = EventProcessor { schema_definitions: HashMap::new(), + compilation_errors: HashMap::new(), }; // ... existing code ... if let Some(pattern) = regex.pattern.and_then(|pattern| { Regex::new(&pattern) .inspect_err(|err| { error!("Error compiling regex pattern: {err}; Pattern: {pattern}") + processor.compilation_errors + .entry(format.name.clone()) + .or_insert_with(Vec::new) + .push(format!("{}: {}", pattern, err)); }) .ok() }) { schema.patterns.push(pattern); } processor } + + /// Returns information about patterns that failed to compile + pub fn get_compilation_errors(&self) -> &HashMap<String, Vec<String>> { + &self.compilation_errors + } }
168-171
: Incorrect error description comment.The documentation comment for the
Err(Unacceptable)
return value incorrectly states "JSON provided is acceptable for the known format" when it should indicate the opposite./// # Returns /// * `Ok` - The original JSON will now contain extracted fields -/// * `Err(Unacceptable)` - JSON provided is acceptable for the known format +/// * `Err(Unacceptable)` - JSON provided is not acceptable for the known format
145-154
: Consider verifying if any patterns were successfully compiled.If all regex patterns for a format fail to compile, you'll end up with a schema that has field mappings but no patterns. This could lead to silent failures where extraction is attempted but can never succeed.
Consider adding a validation step after processing all patterns:
for format in formats { + let mut has_valid_patterns = false; for regex in format.regex { let schema = processor .schema_definitions .entry(format.name.clone()) .or_insert_with(SchemaDefinition::default); schema.field_mappings.push(regex.fields.clone()); // Compile the regex pattern if present // NOTE: we only warn if the pattern doesn't compile if let Some(pattern) = regex.pattern.and_then(|pattern| { Regex::new(&pattern) .inspect_err(|err| { error!("Error compiling regex pattern: {err}; Pattern: {pattern}") }) .ok() }) { schema.patterns.push(pattern); + has_valid_patterns = true; } } + + // If no patterns were valid, log a warning + if !has_valid_patterns && processor.schema_definitions.contains_key(&format.name) { + warn!("Format '{}' has no valid regex patterns, extraction will always fail", format.name); + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(2 hunks)src/handlers/http/ingest.rs
(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/event/format/mod.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(106-111)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (11)
src/handlers/http/ingest.rs (8)
31-33
: Imports added for new JSON extraction functionality.The imports of
Unacceptable
andEXTRACT_LOG_KEY
indicate the integration with the new JSON extraction feature from log lines.
69-73
: Well-implemented extraction of optional log value from headers.This code properly extracts an optional log format identifier from request headers, using the chain of
and_then
andto_str().ok()
for safe conversion.
90-90
: Header extraction passed to flattening function.The extracted log value is now passed to the
flatten_and_push_logs
function, allowing for JSON extraction from log lines during ingestion.
153-153
: Consistent use ofNone
for OTEL endpoints.You correctly pass
None
for the extract_log parameter in OTEL-specific handlers, indicating that no JSON extraction should occur for these specialized formats.Also applies to: 191-191, 231-231
273-277
: Consistent implementation of log extraction in post_event.The same pattern for extracting optional log values from headers is implemented here, maintaining consistency across all ingestion endpoints.
285-285
: Header extraction correctly passed to flattening function.Similar to line 90, the extracted log value is passed to
flatten_and_push_logs
to enable JSON extraction during processing.
351-352
: New error variant for schema validation.A new
KnownFormat
error variant has been added to thePostError
enum to handle cases where log format extraction fails, with automatic conversion from theUnacceptable
error type.
379-379
: Appropriate error status code for format validation failures.The new error variant correctly returns a
BAD_REQUEST
status code, which is appropriate for client-provided data that doesn't match the expected format.src/event/format/known_schema.rs (3)
27-32
: Good implementation of global lazy-loaded schema definitions.Using
Lazy
for the global instance ensures the JSON parsing only happens once at first access, which is efficient. The embedded JSON file throughinclude_str!
is a good approach for packaging the format definitions with the application.
34-36
: Simple and focused error type.The
Unacceptable
error type provides a clear message that includes the name of the format that failed to match, which will be helpful for debugging and user feedback.
60-102
: Well-implemented field extraction with performance optimizations.The
check_or_extract
method is well-designed with multiple efficiency considerations:
- Early return if fields already exist
- Graceful handling of missing log field
- Iterative pattern matching with early success return
This approach minimizes unnecessary regex processing which can be performance-intensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)
211-211
:⚠️ Potential issueAvoid using
unreachable!()
for input validation.Using
unreachable!()
for validating inputs is risky, as it will cause the application to panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.- _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/handlers/http/ingest.rs (1)
289-297
: Structured extraction for post_event handler.The implementation mirrors the pattern in the
ingest
function but is structured differently. This section could potentially be refactored to reduce duplication and share code between the handlers.Consider extracting the common log source handling logic into a separate function to reduce duplication between
ingest
andpost_event
:- match &log_source { - LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { - return Err(PostError::OtelNotSupported) - } - LogSource::Custom(src) => { - KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?; - } - _ => {} - } + handle_log_extraction(&mut json, &log_source, extract_log)?; // Add this helper function elsewhere in the file: // fn handle_log_extraction(json: &mut Value, log_source: &LogSource, extract_log: Option<&str>) -> Result<(), PostError> { // match log_source { // LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { // return Err(PostError::OtelNotSupported) // } // LogSource::Custom(src) => { // KNOWN_SCHEMA_LIST.extract_from_inline_log(json, src, extract_log)?; // } // _ => {} // } // Ok(()) // }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/event/format/known_schema.rs
(1 hunks)src/handlers/http/ingest.rs
(7 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/handlers/http/modal/utils/ingest_utils.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/mod.rs (2) (2)
new
(106-111)from
(70-80)src/event/format/known_schema.rs (1) (1)
new
(147-167)src/metadata.rs (1) (1)
new
(95-130)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (17)
src/handlers/http/ingest.rs (10)
31-31
: New import adds JSON schema extraction functionality.The import includes
Unacceptable
error type andKNOWN_SCHEMA_LIST
for validating and extracting structured data from logs according to predefined schemas.
33-33
: NewEXTRACT_LOG_KEY
constant integration.This import refers to a constant that specifies the HTTP header key used to identify the field containing the raw log data for extraction.
52-52
: Function signature modification adds mutable JSON handling.The parameter is now marked as
mut json
to allow modification of the JSON during log extraction.
69-73
: New header extraction logic for log field identification.This code extracts an optional field name from request headers that identifies which field in the JSON contains the raw log text to be parsed.
81-90
: Enhanced log source handling with structured extraction.The code now handles different log sources, particularly focusing on extracting structured data from custom logs using regex patterns. For OTel formats, it properly returns an error, while for custom sources it attempts to extract fields based on the defined schema.
91-91
: Improved log source metadata with extracted fields.This creates a
LogSourceEntry
that includes both the log source type and the set of fields that were successfully extracted, which enhances the metadata for the stream.
253-253
: Mutable JSON parameter for in-place field extraction.The JSON parameter is now mutable to allow the extraction process to modify it by adding structured fields parsed from log lines.
284-288
: Consistent header extraction logic across handlers.This implements the same header extraction logic as in the
ingest
function, ensuring consistent behavior when extracting the log field name.
365-366
: New error variant for schema validation failures.A new
KnownFormat
error type is added to handle cases where the log doesn't match the expected format according to the predefined schema.
393-393
: Proper HTTP status code for format validation errors.Returns a
BAD_REQUEST
status code when a log doesn't match the expected format, which is appropriate for client-side errors related to input validation.src/event/format/known_schema.rs (7)
27-32
: Global schema definitions from embedded JSON resource.The code loads predefined log format definitions from an embedded JSON file and initializes a global singleton using the
Lazy
pattern, which ensures thread-safe initialization on first access.
34-36
: Clear error type for format validation failures.The
Unacceptable
error type provides meaningful error messages when logs don't match the expected format, improving diagnostics.
40-51
: Resilient regex deserializer ignores invalid patterns.The deserializer handles malformed regex patterns gracefully by emitting warnings rather than failing completely. This prevents a single bad pattern from breaking the entire schema.
64-67
: Schema definition structure for pattern collections.The
SchemaDefinition
struct provides a clean abstraction for managing multiple regex patterns for a single log format.
83-129
: Efficient field extraction with preemptive checking.The
check_or_extract
method first checks if all required fields are already present before attempting extraction, which optimizes performance by avoiding unnecessary regex operations.
179-215
: Comprehensive JSON extraction with multi-format support.The
extract_from_inline_log
method handles both single JSON objects and arrays of objects, providing flexibility for different log formats.
218-497
: Comprehensive test suite for extraction functionality.The tests cover various scenarios including successful extraction, handling of pre-existing fields, failed matches, and error cases, which ensures the robustness of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/event/format/known_schema.rs (2)
147-154
: 🛠️ Refactor suggestionPotential panic during initialization
Using
expect
when parsing the JSON schema definitions could cause a panic during application startup if the JSON is malformed.Consider handling JSON parsing errors more gracefully:
- let formats: Vec<Format> = - serde_json::from_str(json_text).expect("Known formats are stored as JSON text"); + let formats: Vec<Format> = match serde_json::from_str(json_text) { + Ok(formats) => formats, + Err(e) => { + error!("Failed to parse format definitions: {}", e); + Vec::new() // Return empty formats to allow application to start with reduced functionality + } + };
211-211
:⚠️ Potential issueAvoid using
unreachable!()
for input validationUsing
unreachable!()
for handling non-object/array JSON values is risky, as it will cause the application to panic if reached, which could be triggered by malformed client data.- _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (1)
src/event/format/known_schema.rs (1)
179-215
: Consider more granular error handling for array processingCurrently, if any event in an array fails extraction, the entire operation fails. Consider a more flexible approach that could skip or separately report problematic entries.
pub fn extract_from_inline_log( &self, json: &mut Value, log_source: &str, extract_log: Option<&str>, -) -> Result<HashSet<String>, Unacceptable> { +) -> Result<(HashSet<String>, Vec<usize>), Unacceptable> { let Some(schema) = self.schema_definitions.get(log_source) else { warn!("Unknown log format: {log_source}"); - return Ok(Default::default()); + return Ok((Default::default(), vec![])); }; let mut fields = HashSet::new(); + let mut failed_indices = Vec::new(); match json { Value::Array(list) => { - for event in list { + for (idx, event) in list.iter_mut().enumerate() { let Value::Object(event) = event else { + failed_indices.push(idx); continue; }; if let Some(known_fields) = schema.check_or_extract(event, extract_log) { fields.extend(known_fields); } else { - return Err(Unacceptable(log_source.to_owned())); + failed_indices.push(idx); } } + if !failed_indices.is_empty() && failed_indices.len() == list.len() { + return Err(Unacceptable(format!("All events failed extraction for format {}", log_source))); + } } Value::Object(event) => { if let Some(known_fields) = schema.check_or_extract(event, extract_log) { - return Ok(known_fields); + return Ok((known_fields, vec![])); } else { return Err(Unacceptable(log_source.to_owned())); } } - _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))), } - Ok(fields) + Ok((fields, failed_indices)) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/event/format/known_schema.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (5)
src/event/format/known_schema.rs (5)
40-51
: Good approach for handling invalid regex patternsThe function logs any regex compilation errors but continues execution by returning
None
. This gracefully handles malformed patterns without disrupting the entire application.
83-128
: Robust implementation of field extraction logicThe
check_or_extract
method follows a clear sequential approach: first checking if fields already exist, then attempting extraction if needed. The error handling is thorough with proper early returns.
185-188
: Good error handling for unknown log formatsThe code gracefully handles unknown log formats by logging a warning and returning an empty set of fields rather than failing the operation. This allows the application to continue functioning even when log formats aren't recognized.
25-25
: Good use of tracing crate for error loggingUsing the
tracing
crate for logging allows for structured logs with metadata, which is better than simpleprintln!
oreprintln!
statements.
218-497
: Comprehensive test coverageThe tests cover a wide range of scenarios including successful extractions, handling of pre-existing fields, error cases, and array processing. This thorough testing increases confidence in the code's reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/event/format/known_schema.rs (2)
185-188
: Consider more specific error handling for unknown formats.When an unknown log format is requested, the code currently returns an empty
HashSet
after logging a warning. Consider whether returning an error would be more appropriate to provide clearer feedback to callers.- let Some(schema) = self.schema_definitions.get(log_source) else { - warn!("Unknown log format: {log_source}"); - return Ok(HashSet::new()); - }; + let Some(schema) = self.schema_definitions.get(log_source) else { + warn!("Unknown log format: {log_source}"); + return Err(Unacceptable(format!("Unknown log format: {}", log_source))); + };
197-202
: Reconsider all-or-nothing approach for array items.If one item in a JSON array fails extraction, the entire operation fails. Consider whether it might be better to extract what's possible and report partial success, especially for batch processing scenarios.
You could collect successful extractions and failures separately, then decide how to handle them based on application requirements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/event/format/known_schema.rs
(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(106-111)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: coverage
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (7)
src/event/format/known_schema.rs (7)
40-51
: Good approach for handling regex compilation errors.The
deserialize_regex
function handles regex compilation errors gracefully by logging them and continuing withNone
instead of causing a panic. This ensures that malformed regex patterns won't crash the application during startup.
147-154
: Potential panic during initialization.The
new
method usesexpect
when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.Consider handling JSON parsing errors more gracefully:
- let formats: Vec<Format> = - serde_json::from_str(json_text).expect("Known formats are stored as JSON text"); + let formats: Vec<Format> = match serde_json::from_str(json_text) { + Ok(formats) => formats, + Err(e) => { + error!("Failed to parse format definitions: {}", e); + Vec::new() // Return empty formats to allow application to start with reduced functionality + } + };
83-101
: Optimized field presence check.The method efficiently checks if fields are already present in the JSON object before attempting regex extraction. This optimization avoids unnecessary regex processing when data is already structured correctly.
103-125
: Well-structured extraction logic.The extraction logic is well-implemented - it tries each pattern sequentially, extracts fields with named capture groups, and extends the original object with the extracted values. The early return pattern is used effectively to stop processing once a matching pattern is found.
179-215
: Handle all JSON value types properly.The
extract_from_inline_log
method has good handling for JSON objects and arrays, but usesunreachable!()
for other JSON types which could lead to application panics if unexpected input is received.Replace the use of
unreachable!()
with a proper error return:- _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
218-497
: Comprehensive test coverage.The test suite is thorough and covers various scenarios:
- Basic extraction functionality
- Pre-existing fields detection
- Error handling for non-matching logs
- Missing fields and patterns
- Object and array processing
- Edge cases like unknown formats and missing log fields
103-109
: Robust handling of missing patterns.The code gracefully continues to the next pattern when encountering patterns that failed to compile (
None
value) or don't match the input. This ensures resilience against individual pattern failures.
Failure condition when expected fields not present
Success otherwise
Also uses regex to capture from text:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/event/format/known_schema.rs (2)
148-168
: 🛠️ Refactor suggestionHandle JSON parsing errors gracefully in EventProcessor::new.
As noted earlier, the use of
expect
could cause a panic if the JSON is malformed. This method should be refactored to handle errors gracefully.
180-216
:⚠️ Potential issueAvoid unreachable! in extract_from_inline_log.
Using
unreachable!()
for input validation at line 211 is risky, as it will panic if reached. Since this could be triggered by malformed client data, it should be handled more gracefully.- _ => unreachable!("We don't accept events of the form: {json}"), + _ => return Err(Error::Unacceptable(format!("Unexpected JSON type, only objects and arrays are supported"))),
🧹 Nitpick comments (4)
src/handlers/http/ingest.rs (2)
84-94
: Simplify the duplicate OTEL log source check.There are two separate checks for OTEL log sources - one at lines 77-82 and another at lines 84-87. This is redundant and could be simplified.
- let fields = match &log_source { - LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { - return Err(PostError::OtelNotSupported) - } - LogSource::Custom(src) => { - KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)? - } - _ => HashSet::new(), - }; + let fields = if let LogSource::Custom(src) = &log_source { + KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)? + } else { + HashSet::new() + };
292-300
: Consider returning extracted fields for all log types.The current implementation ignores the return value from
extract_from_inline_log
for Custom log sources, and doesn't attempt extraction for other non-OTEL sources. Consider using the extracted field information for all log types to improve consistency.match &log_source { LogSource::OtelLogs | LogSource::OtelMetrics | LogSource::OtelTraces => { return Err(PostError::OtelNotSupported) } LogSource::Custom(src) => { - KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?; + let fields = KNOWN_SCHEMA_LIST.extract_from_inline_log(&mut json, src, extract_log)?; + // Potentially use the extracted fields here } _ => {} }src/event/format/known_schema.rs (2)
1-17
: Copyright date should be 2024, not 2025.The copyright header includes the year 2025, but the current year is 2024. This should be corrected.
-/* Parseable Server (C) 2022 - 2025 Parseable, Inc. +/* Parseable Server (C) 2022 - 2024 Parseable, Inc.
87-129
: Improve the check_or_extract method's error reporting.The method returns
None
on extraction failure but doesn't provide context on why extraction failed. It also returnsSome(pattern.fields.clone())
when fields are already present, which might not be intuitive.Consider refining the return type to provide more context:
- pub fn check_or_extract( - &self, - obj: &mut Map<String, Value>, - extract_log: Option<&str>, - ) -> Option<HashSet<String>> { + pub fn check_or_extract( + &self, + obj: &mut Map<String, Value>, + extract_log: Option<&str>, + ) -> Result<HashSet<String>, &'static str> { if let Some(pattern) = self .patterns .iter() .find(|pattern| pattern.fields.iter().all(|field| obj.contains_key(field))) { - return Some(pattern.fields.clone()); + return Ok(pattern.fields.clone()); } let event = extract_log .and_then(|field| obj.get(field)) .and_then(|s| s.as_str())?; + .ok_or("Log field not found or not a string")?; for format in self.patterns.iter() { let Some(pattern) = format.pattern.as_ref() else { continue; }; let Some(captures) = pattern.captures(event) else { continue; }; let mut extracted_fields = Map::new(); // With named capture groups, you can iterate over the field names for field_name in format.fields.iter() { if let Some(value) = captures.name(field_name) { extracted_fields.insert( field_name.to_owned(), Value::String(value.as_str().to_string()), ); } } obj.extend(extracted_fields); - return Some(format.fields.clone()); + return Ok(format.fields.clone()); } - None + Err("No matching pattern found") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/event/format/known_schema.rs
(1 hunks)src/handlers/http/ingest.rs
(7 hunks)
🔇 Additional comments (9)
src/handlers/http/ingest.rs (7)
31-31
: Good change: Including the known_schema module.Adding the import for the known_schema module allows the ingest handlers to use the new functionality for detecting and extracting structured data from log lines.
33-33
: Adding the EXTRACT_LOG_KEY import for header extraction.This change adds the capability to extract log format information from request headers, which is needed for the new feature.
72-76
: Extracting log format from headers.This code properly extracts an optional log format identifier from request headers, which will be used to determine how to process the logs.
256-256
: Taking mutable JSON input to support field extraction.The
mut
modifier for JSON in the function signature is necessary to allow the new field extraction logic to modify the input JSON object.
287-291
: Extracting log format from headers in post_event.This code extracts the optional log format identifier from request headers, maintaining consistency with the changes in the
ingest
function.
368-369
: Adding KnownFormat error variant.Good addition of a new error variant to handle errors from the known_schema module, allowing proper error propagation and handling.
396-396
: Properly mapping KnownFormat errors to BAD_REQUEST.Setting the appropriate status code for the new error type ensures that clients receive clear feedback when there are issues with the log format.
src/event/format/known_schema.rs (2)
44-55
: Good approach for regex deserialization.The custom deserialization logic correctly handles invalid regex patterns by logging errors rather than causing failures. This makes the application more robust against configuration issues.
218-497
: Comprehensive test coverage.The test suite is thorough and covers many important scenarios:
- Successful extraction from different log formats
- Handling already existing fields
- Non-matching logs
- Missing log fields
- Array and object processing
This comprehensive testing will help ensure the reliability of the feature.
7ff8ed2
to
b40779c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile.debug
(1 hunks)Dockerfile.kafka
(1 hunks)resources/formats.json
(1 hunks)src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(2 hunks)src/handlers/http/ingest.rs
(7 hunks)src/handlers/http/logstream.rs
(4 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(1 hunks)src/handlers/mod.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- src/handlers/http/modal/utils/ingest_utils.rs
- src/handlers/mod.rs
- Dockerfile.debug
- src/handlers/http/logstream.rs
- Dockerfile.kafka
- resources/formats.json
- src/event/format/known_schema.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/handlers/http/ingest.rs (3)
src/event/format/known_schema.rs (1) (1)
new
(148-168)src/event/format/mod.rs (2) (2)
new
(118-123)from
(82-92)src/metadata.rs (1) (1)
new
(95-130)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: coverage
🔇 Additional comments (9)
src/event/format/mod.rs (2)
42-42
: New module additionAdding the
known_schema
module is a good approach for handling structured log formats based on regex patterns. This aligns with the PR objective of detecting and extracting JSON from log lines.
89-90
: Improved LogSource enum handlingThis change enhances the readability and correctness of the
LogSource
enum conversion by:
- Explicitly handling empty strings (
""
) asLogSource::Json
- Creating a dedicated path for custom formats
This is a necessary improvement to support the new custom log format extraction feature.
src/handlers/http/ingest.rs (7)
31-31
: New import for known schema handlingImporting the
known_schema
module and itsKNOWN_SCHEMA_LIST
which contains predefined schema definitions. This is essential for supporting the feature to detect and extract JSON from log-lines.
33-33
: Added extract log key importThe
EXTRACT_LOG_KEY
constant is now imported to extract the log field identifier from HTTP headers, which will be used to specify which part of the log entry should be parsed for JSON extraction.
72-75
: Extract log field from headerGood implementation of extracting an optional log field identifier from the request headers. This parameter allows clients to specify which field in the log entry contains JSON data that should be extracted.
84-95
: Added JSON extraction for custom log formatsThis implementation properly handles different log source types:
- Returns an appropriate error for unsupported OpenTelemetry formats
- Extracts JSON data from custom log formats using the schema processor
- Creates a log source entry with extracted fields
The extracted fields set will later be stored with the stream metadata, enhancing query capabilities.
369-369
: Added known format error variantThe addition of the
KnownFormat
error variant to thePostError
enum properly integrates errors from theknown_schema
module into the HTTP error handling system.
396-396
: Error handling for known format errorsThe error handling logic correctly maps
KnownFormat
errors to aBAD_REQUEST
status code, which is appropriate for client-side errors related to invalid log format specifications.
52-95
:❓ Verification inconclusive
Verify the end-to-end functionality
The changes implement a feature for detecting and extracting JSON from log lines, but it would be beneficial to verify how it works end-to-end.
🏁 Script executed:
#!/bin/bash # Check for the implementation of the EXTRACT_LOG_KEY constant echo "Checking for EXTRACT_LOG_KEY constant definition:" rg "EXTRACT_LOG_KEY.*=" --glob "src/handlers/mod.rs" # Look at how EventProcessor is initialized with format definitions echo -e "\nChecking EventProcessor initialization:" rg "KNOWN_SCHEMA_LIST.*=.*Lazy" -A 10 --glob "src/event/format/known_schema.rs" # Check if there's a formats.json file containing regex patterns echo -e "\nChecking for formats.json file:" fd "formats.json" # If found, examine its contents to understand the supported log formats echo -e "\nExamining format definitions:" find . -name "formats.json" -exec cat {} \; | head -n 25Length of output: 2901
Action required: Verify complete end-to-end log ingestion and extraction functionality
The feature for detecting and extracting JSON from inline log lines appears to be integrated as follows:
- The
EXTRACT_LOG_KEY
constant is defined insrc/handlers/mod.rs
as"x-p-extract-log"
.- The inline log extraction via
KNOWN_SCHEMA_LIST.extract_from_inline_log
is used in the custom log source branch.- The format definitions in
resources/formats.json
are intact and appear to include the necessary regex patterns.Please ensure that:
- End-to-end tests cover requests with valid inline JSON logs, validating that extraction works as expected.
- The error handling for internal stream names and unsupported log sources (e.g., Otel sources) is working correctly.
- The integration between log ingestion, extraction, and downstream event processing (e.g., via EventProcessor) is fully validated.
@de-sh i see stream-info response as below -
there is a difference in the structure of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/event/format/known_schema.rs (1)
153-154
: 🛠️ Refactor suggestionPotential panic during initialization
The
new
method usesexpect
when parsing the JSON schema definitions, which could cause a panic during application startup if the JSON is malformed.Consider handling JSON parsing errors more gracefully:
- let formats: Vec<Format> = - serde_json::from_str(json_text).expect("Known formats are stored as JSON text"); + let formats: Vec<Format> = match serde_json::from_str(json_text) { + Ok(formats) => formats, + Err(e) => { + error!("Failed to parse format definitions: {}", e); + Vec::new() // Return empty formats to allow application to start with reduced functionality + } + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile.debug
(1 hunks)Dockerfile.kafka
(1 hunks)resources/formats.json
(1 hunks)src/event/format/known_schema.rs
(1 hunks)src/event/format/mod.rs
(3 hunks)src/handlers/http/ingest.rs
(7 hunks)src/handlers/http/logstream.rs
(4 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(1 hunks)src/handlers/mod.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/handlers/mod.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- Dockerfile.kafka
- src/handlers/http/modal/utils/ingest_utils.rs
- Dockerfile.debug
- src/handlers/http/logstream.rs
- src/handlers/http/ingest.rs
🧰 Additional context used
🧬 Code Definitions (1)
src/event/format/known_schema.rs (1)
src/event/format/mod.rs (1) (1)
new
(120-125)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (7)
resources/formats.json (1)
1-554
:❓ Verification inconclusive
Extensive collection of log format regex patterns added
This file provides a comprehensive set of regex patterns for parsing different log formats, which will greatly enhance Parseable's ability to extract structured data from various logging systems.
A few observations:
- Each format includes a
body
field to capture remaining content- Multiple regex patterns per format handle variations of the same log type
- The patterns are detailed and account for many edge cases
Since this file is critical for log parsing functionality, let's verify that the regex patterns compile correctly:
🏁 Script executed:
#!/bin/bash # Check if all regex patterns compile correctly cd resources jq -r '.[] | .regex[] | .pattern' formats.json | while read -r pattern; do if ! echo "" | grep -P "$pattern" &>/dev/null; then echo "Invalid regex pattern: $pattern" fi doneLength of output: 64023
Action Required: Verify Regex Compatibility with Your Parser Engine
The shell script output reveals that nearly every regex pattern in the file fails to compile using our
grep -P
check. However, these errors are likely due to known limitations of thegrep -P
implementation—especially its handling (or lack thereof) of the named capture group syntax (e.g.(?<name>…)
and(?P<name>…)
). If your log parser uses a regex engine that fully supports these constructs (such as the Pythonre
module or a fully featured PCRE library), these errors may not affect production functionality.Points to confirm:
- Ensure that the regex engine used in your production log parser fully supports the syntax used in these patterns.
- Consider adding tests in your actual parser environment to verify pattern matching and data extraction.
- If your production engine cannot support these constructs, you may need to refactor the regexes to use a compatible syntax.
src/event/format/mod.rs (3)
42-42
: New module added for known schema handlingAdding this module enables the application to work with predefined log formats.
78-80
: Custom log format support added to LogSource enumThe
Custom(String)
variant with#[serde(untagged)]
attribute enables deserialization of user-defined log formats, allowing for more flexible log source handling.
91-92
: Modified LogSource string conversionThe implementation now explicitly handles empty strings as JSON format and properly creates custom format types for other inputs. This is more explicit and clear compared to the previous catch-all approach.
src/event/format/known_schema.rs (3)
44-55
: Good handling of regex compilation errorsThe
deserialize_regex
function appropriately logs compilation errors but allows the application to continue running. This approach is resilient as it won't break the entire application if a single regex pattern is invalid.
87-129
: Well-designed field extraction with fallback mechanismThe
check_or_extract
method intelligently:
- First checks if all fields are already present
- Only attempts extraction if needed
- Tries multiple patterns until one matches
This approach minimizes regex processing overhead and handles partial matches gracefully.
218-497
: Comprehensive test coverageThe test module is thorough, covering:
- Regular expression pattern matching
- Field extraction logic
- Handling of already present fields
- Error cases like no match, missing fields
- Processing of both single objects and arrays
This level of testing increases confidence in the implementation.
const FORMATS_JSON: &str = include_str!("../../../resources/formats.json"); | ||
|
||
/// Global instance of EventProcessor containing predefined schema definitions | ||
pub static KNOWN_SCHEMA_LIST: Lazy<EventProcessor> = | ||
Lazy::new(|| EventProcessor::new(FORMATS_JSON)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider error handling during initialization
The use of a Lazy
static ensures thread-safe initialization, which is good. However, the EventProcessor::new
method uses expect
which could cause a panic if the JSON is malformed, potentially bringing down the entire application during startup.
Consider implementing more graceful error handling during initialization:
pub static KNOWN_SCHEMA_LIST: Lazy<EventProcessor> =
- Lazy::new(|| EventProcessor::new(FORMATS_JSON));
+ Lazy::new(|| {
+ match EventProcessor::try_new(FORMATS_JSON) {
+ Ok(processor) => processor,
+ Err(e) => {
+ error!("Failed to parse format definitions: {}", e);
+ EventProcessor { schema_definitions: HashMap::new() }
+ }
+ }
+ });
Then update the EventProcessor
implementation:
impl EventProcessor {
fn try_new(json_text: &str) -> Result<Self, serde_json::Error> {
let mut processor = EventProcessor {
schema_definitions: HashMap::new(),
};
let formats: Vec<Format> = serde_json::from_str(json_text)?;
for format in formats {
for regex in format.regex {
let schema = processor
.schema_definitions
.entry(format.name.clone())
.or_default();
schema.patterns.push(regex);
}
}
Ok(processor)
}
}
pub fn extract_from_inline_log( | ||
&self, | ||
json: &mut Value, | ||
log_source: &str, | ||
extract_log: Option<&str>, | ||
) -> Result<HashSet<String>, Error> { | ||
let Some(schema) = self.schema_definitions.get(log_source) else { | ||
return Err(Error::Unknown(log_source.to_owned())); | ||
}; | ||
|
||
let mut fields = HashSet::new(); | ||
match json { | ||
Value::Array(list) => { | ||
for event in list { | ||
let Value::Object(event) = event else { | ||
continue; | ||
}; | ||
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | ||
fields.extend(known_fields); | ||
} else { | ||
return Err(Error::Unacceptable(log_source.to_owned())); | ||
} | ||
} | ||
} | ||
Value::Object(event) => { | ||
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | ||
return Ok(known_fields); | ||
} else { | ||
return Err(Error::Unacceptable(log_source.to_owned())); | ||
} | ||
} | ||
_ => unreachable!("We don't accept events of the form: {json}"), | ||
} | ||
|
||
Ok(fields) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Robust method for extracting log fields from JSON structures
The extract_from_inline_log
method handles both JSON objects and arrays, applies appropriate schema definitions, and collects field names from successful extractions.
However, there's one issue:
The unreachable!
macro on line 211 will cause a panic if reached. Since this function processes user input, it should handle unexpected inputs gracefully.
- _ => unreachable!("We don't accept events of the form: {json}"),
+ _ => return Err(Error::Unacceptable(format!("Unexpected JSON format: {}", log_source))),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
pub fn extract_from_inline_log( | |
&self, | |
json: &mut Value, | |
log_source: &str, | |
extract_log: Option<&str>, | |
) -> Result<HashSet<String>, Error> { | |
let Some(schema) = self.schema_definitions.get(log_source) else { | |
return Err(Error::Unknown(log_source.to_owned())); | |
}; | |
let mut fields = HashSet::new(); | |
match json { | |
Value::Array(list) => { | |
for event in list { | |
let Value::Object(event) = event else { | |
continue; | |
}; | |
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | |
fields.extend(known_fields); | |
} else { | |
return Err(Error::Unacceptable(log_source.to_owned())); | |
} | |
} | |
} | |
Value::Object(event) => { | |
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | |
return Ok(known_fields); | |
} else { | |
return Err(Error::Unacceptable(log_source.to_owned())); | |
} | |
} | |
_ => unreachable!("We don't accept events of the form: {json}"), | |
} | |
Ok(fields) | |
} | |
} | |
pub fn extract_from_inline_log( | |
&self, | |
json: &mut Value, | |
log_source: &str, | |
extract_log: Option<&str>, | |
) -> Result<HashSet<String>, Error> { | |
let Some(schema) = self.schema_definitions.get(log_source) else { | |
return Err(Error::Unknown(log_source.to_owned())); | |
}; | |
let mut fields = HashSet::new(); | |
match json { | |
Value::Array(list) => { | |
for event in list { | |
let Value::Object(event) = event else { | |
continue; | |
}; | |
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | |
fields.extend(known_fields); | |
} else { | |
return Err(Error::Unacceptable(log_source.to_owned())); | |
} | |
} | |
} | |
Value::Object(event) => { | |
if let Some(known_fields) = schema.check_or_extract(event, extract_log) { | |
return Ok(known_fields); | |
} else { | |
return Err(Error::Unacceptable(log_source.to_owned())); | |
} | |
} | |
_ => return Err(Error::Unacceptable(format!("Unexpected JSON format: {}", log_source))), | |
} | |
Ok(fields) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/event/format/mod.rs
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
🔇 Additional comments (3)
src/event/format/mod.rs (3)
42-42
: LGTM on adding known_schema moduleThis new module aligns with the PR's objective to detect and extract JSON from log lines, providing a structured approach to handling known log schemas.
91-92
: Improvement in specificityThe code now explicitly handles empty strings and "json" separately from other custom formats, which improves clarity. This explicit handling aligns with the PR's objective of distinguishing between different log formats.
85-93
:❓ Verification inconclusive
Consider backwards compatibility implications
Changing the default behavior from treating all unmatched strings as
LogSource::Json
to treating them asLogSource::Custom(string)
might impact existing code. This is a behavioral change that could affect systems that rely on the previous default.
🏁 Script executed:
#!/bin/bash # Find places where LogSource is used to understand potential impacts rg "LogSource::" --type rust -A 3 -B 3 # Check if there are any tests that might be relying on this behavior rg "LogSource::from" --type rust -A 5Length of output: 16163
Action Required: Verify Backwards Compatibility for LogSource Default Behavior
The change in the
From<&str>
implementation now producesLogSource::Custom(custom.to_owned())
for any string that isn’t one of the recognized inputs ("kinesis", "otel-logs", "otel-metrics", "otel-traces", "pmeta", or the empty string/"json"). Previously, any unrecognized string might have defaulted toLogSource::Json
. Although tests (e.g., insrc/handlers/http/logstream.rs
and elsewhere) confirm that expected behavior has shifted—usingLogSource::Custom
for inputs like"teststream"
while preservingJson
for the empty or"json"
cases—it’s important to ensure that:
- External integrations or client code aren’t inadvertently relying on the old behavior (i.e., defaulting unrecognized strings to
Json
).- Documentation and/or migration guides clearly communicate this behavioral change so that systems expecting a
Json
default for unrecognized inputs can be reviewed and, if necessary, updated.- All integration tests and module usages (as seen in files like
src/handlers/http/ingest.rs
andsrc/utils/json/mod.rs
) have been verified against this change to avoid unexpected runtime issues.Please double-check the external dependencies and any integrations that assume an unrecognized log source will resolve to
Json
rather than a custom variant.
// Custom Log Sources e.g. "syslog" | ||
#[serde(untagged)] | ||
Custom(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Check placement of serde attribute
The #[serde(untagged)]
attribute is typically used on entire enums, not on specific variants. It instructs serde to try all possible deserialization paths. Placing it on a single variant might not have the intended effect.
🏁 Script executed:
#!/bin/bash
# Check if the untagged attribute is used similarly in other parts of the codebase
rg "#\[serde\(untagged\)\]" -A 2 -B 2
Length of output: 739
Action Required: Move #[serde(untagged)]
from variant to enum level
The attribute is currently applied directly on the Custom(String)
variant in src/event/format/mod.rs
(lines 78–80), but that pattern isn’t effective. In other parts of the codebase (e.g., in src/rbac/user.rs
and src/alerts/target.rs
), the same attribute is applied to the entire enum. Please update the code so that the #[serde(untagged)]
attribute decorates the enum instead of an individual variant to ensure the proper deserialization behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
resources/formats.json (1)
1-600
: Well-structured collection of log format definitions with comprehensive regex patterns.This new file provides a robust set of regex patterns for parsing various log formats. The structure is consistent and well-organized, with each format having a name and associated regex patterns with named capture groups.
While the implementation appears solid, consider the following suggestions:
- Some regex patterns are quite complex (e.g., line 27 for ALB logs), which might impact performance during log parsing.
- Consider adding version information or a last updated timestamp to the file to track changes over time.
- Adding a short comment at the beginning of the file explaining its purpose and how it's used would improve maintainability.
Would you like me to suggest some optimizations for the more complex regex patterns or add documentation to explain the file's purpose?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
resources/formats.json
(1 hunks)src/migration/stream_metadata_migration.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (1)
src/migration/stream_metadata_migration.rs (1)
259-259
: Update to log source structure in expected test output.This change reflects the migration from a simple string representation of log source to a structured format with
log_source_format
andfields
properties. This aligns with the new format definitions introduced in the formats.json file.The test adjustment ensures that the v5_v6 migration properly handles unknown log source types by converting them to the new structured format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
resources/formats.json (2)
1-22
: Inconsistent Capture Group Syntax in "access_log" Definitions
The first regex pattern (lines 5–8) uses the syntax(?<timestamp>...)
while the following ones use Python’s named group style(?P<name>...)
. To avoid possible confusion or parsing issues across different regex engines, please consider standardizing the capture group syntax consistently across all entries.
1-613
: Overall Complexity and Maintainability Considerations
This file defines a very extensive list of log formats with intricate regex patterns. While the comprehensive nature of the patterns is commendable for parsing varied log types, please consider the following improvements:
- Documentation: Add inline comments or maintain an external documentation file summarizing each log format’s purpose and the intended meaning of its capture groups. This will assist future maintainers when updating or debugging these patterns.
- Modularization: As the file grows, it might be beneficial to split the formats into multiple files (e.g., grouping by log source) or add metadata that categorizes each entry. This could improve readability and ease future enhancements.
Overall, ensure that each regex pattern is well tested against representative log samples, especially given the mix in capture group conventions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
resources/formats.json
(1 hunks)src/handlers/http/ingest.rs
(7 hunks)src/handlers/http/modal/utils/ingest_utils.rs
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/handlers/http/modal/utils/ingest_utils.rs
- src/handlers/http/ingest.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
{ | ||
"name": "web_robot_log", | ||
"regex": [ | ||
{ | ||
"fields": ["ip", "timestamp", "method", "resource", "response", "bytes", "referrer", "request", "request-id", "useragent"] | ||
} | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Regex Pattern in "web_robot_log" Entry
The object for the "web_robot_log" (lines 519–525) contains only the "fields"
array but does not provide a "pattern"
. Without a regex pattern, the parser will not be able to match any log lines for this format. Please add the appropriate regex pattern or, if intentionally omitted, include a comment explaining the rationale.
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit